02 Supervised Machine Learning Regression
Load the absolute path to the data directory.
1 Libraries
Load the following libraries.
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#> Loading required package: PerformanceAnalytics
#> Loading required package: xts
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#>
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#>
#>
#> ######################### Warning from 'xts' package ##########################
#> # #
#> # The dplyr lag() function breaks how base R's lag() function is supposed to #
#> # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
#> # source() into this session won't work correctly. #
#> # #
#> # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
#> # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
#> # dplyr from breaking base R's lag() function. #
#> # #
#> # Code in packages is not affected. It's protected by R's namespace mechanism #
#> # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
#> # #
#> ###############################################################################
#>
#> Attaching package: 'xts'
#>
#> The following objects are masked from 'package:dplyr':
#>
#> first, last
#>
#>
#> Attaching package: 'PerformanceAnalytics'
#>
#> The following object is masked from 'package:graphics':
#>
#> legend
#>
#> Loading required package: quantmod
#> Loading required package: TTR
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
2 Data
# STOCK PRICES
sp_500_prices_tbl <- read_rds("C:/Users/MaximilianMuza/Documents/Projects/ss24-bdml-MaximilianMuza/data/sp_500_prices_tbl.rds")
sp_500_prices_tbl# SECTOR INFORMATION
sp_500_index_tbl <- read_rds(file.path(data_dir, "sp_500_index_tbl.rds"))
sp_500_index_tbl2.1 Stock Prices Standardization
Stock prices (adjusted stock price) are standardized by converting them into daily returns (percent change from previous day). This is done such that the stock prices are of the same magnitude and can thus be compared. Below is the sp 500 price table shown:
#> Rows: 1,225,765
#> Columns: 8
#> $ symbol <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT…
#> $ date <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07, 2009-01-08, …
#> $ open <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71, 19.52, 19.53…
#> $ high <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79, 19.99, 19.68…
#> $ low <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30, 19.52, 19.01…
#> $ close <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47, 19.82, 19.09…
#> $ volume <dbl> 50084000, 61475200, 58083400, 72709900, 70255400, 49815300, 5…
#> $ adjusted <dbl> 15.86624, 16.01451, 16.20183, 15.22628, 15.70234, 15.23408, 1…
sp_500_daily_returns_tbl <- sp_500_prices_tbl %>%
select(symbol, date, adjusted) %>%
filter(date >= as.Date("2018-01-01")) %>%
group_by(symbol) %>%
mutate(adjusted_lag = lag(adjusted)) %>%
filter(!is.na(adjusted_lag)) %>%
mutate(difference = adjusted - adjusted_lag) %>%
mutate(pct_return = difference / adjusted_lag) %>%
select(symbol, date, pct_return) %>%
ungroup()
sp_500_daily_returns_tbl2.2 Conversion to User-Item Format
The next step involves converting to a user-item format with the symbol in the first column and every other column the value of the daily returns (pct_return) for every stock at each date. The user in this case is the symbol (company), and the item in this case is the pct_return at each date.
Importing the correct results first (just in case I was not able to complete the last step).
sp_500_daily_returns_tbl <- read_rds(file.path(data_dir, "sp_500_daily_returns_tbl.rds"))
sp_500_daily_returns_tblAnd the conversion follows with:
2.3 K-Means Clustering
Importing the correct results first (just in case I was not able to complete the last step).
stock_date_matrix_tbl <- read_rds(file.path(data_dir, "stock_date_matrix_tbl.rds"))
stock_date_matrix_tblAnd then executing the KMeans operation:
# Create kmeans_obj for 4 centers
NUM_CENTERS <- 4
N_START = 20
kmeans_obj <- stock_date_matrix_tbl %>%
select(-symbol) %>%
kmeans(centers = NUM_CENTERS, nstart = N_START)
kmeans_obj$cluster#> [1] 3 2 3 2 3 3 2 3 3 2 2 3 3 3 2 4 4 4 3 3 3 4 3 3 2 3 2 3 3 3 2 2 2 3 3 3 3
#> [38] 4 2 2 2 3 3 3 1 1 3 3 3 4 3 4 2 4 2 3 4 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3
#> [75] 3 3 3 4 3 4 3 3 3 3 3 3 4 2 3 3 3 3 3 4 3 3 3 3 4 4 3 3 3 3 3 4 3 4 3 3 3
#> [112] 1 3 3 4 3 3 2 2 3 3 3 3 3 3 1 1 4 3 3 3 3 3 3 3 3 3 3 3 4 3 3 4 3 4 4 3 1
#> [149] 3 3 2 3 3 4 3 4 3 3 3 1 4 4 4 4 3 3 4 4 2 4 3 3 4 3 1 3 2 3 1 3 4 2 3 3 3
#> [186] 3 3 1 3 3 3 3 3 3 4 1 2 3 3 3 3 4 3 3 2 2 3 2 3 3 3 3 3 1 3 3 3 3 4 3 1 1
#> [223] 3 3 3 3 3 3 1 2 3 3 4 3 3 3 4 3 3 3 2 3 2 2 3 2 2 3 3 2 3 3 4 2 3 3 3 3 3
#> [260] 3 3 3 3 3 3 3 4 3 2 4 4 2 4 1 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 2 3 4 3
#> [297] 3 2 4 4 3 3 2 4 2 3 3 4 3 3 2 3 4 3 3 3 3 4 3 1 3 1 3 2 2 3 3 3 2 2 3 1 3
#> [334] 3 4 3 2 4 3 2 3 3 1 4 3 2 3 3 2 3 3 3 4 1 3 3 3 1 3 3 3 4 4 3 3 4 3 3 3 3
#> [371] 3 4 4 3 3 4 3 4 3 3 4 1 3 3 1 2 3 2 3 3 4 3 3 2 3 3 3 3 3 3 3 3 3 3 4 3 3
#> [408] 3 3 3 4 1 4 3 2 4 4 3 4 3 3 2 3 3 2 3 3 2 4 3 4 3 3 3 3 3 3 3 3 3 2 3 3 3
#> [445] 4 2 2 2 2 3 2 2 3 4 3 3 3 3 3 3 2 3 3 2 3 3 3 1 3 4 3 2 2 4 4 3 3 3 2 4 4
#> [482] 3 3 3 4 1 3 3 3 4 2 1 4 2 1 3 3 3 3 3 3 3
And using glance() to get the tot.withinss.
2.4 Finding Optimal K
# Use purrr to map
kmeans_mapped_tbl <- tibble(centers = 1:30) %>%
mutate(k_means = centers %>% map(kmeans_mapper)) %>%
mutate(glance = k_means %>% map(glance))
kmeans_mapped_tbl# Visualize Scree Plot
kmeans_mapped_tbl %>%
unnest(glance) %>%
ggplot(aes(x = centers, y = tot.withinss)) +
geom_point() +
geom_line() labs(title = "Skree Plot",
subtitle = "Measures the distance each of the symbol is from the closes K-Means center")#> $title
#> [1] "Skree Plot"
#>
#> $subtitle
#> [1] "Measures the distance each of the symbol is from the closes K-Means center"
#>
#> attr(,"class")
#> [1] "labels"
2.5 UMAP Application
#> umap embedding of 502 items in 2 dimensions
#> object components: layout, data, knn, config
# Convert umap results to tibble with symbols
umap_results_tbl <- umap_results$layout %>%
as_tibble() %>%
bind_cols(stock_date_matrix_tbl %>% select(symbol))#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
Finally, let’s make a quick visualization of the umap_results_tbl.
- Pipe the
umap_results_tblintoggplot()mapping the columns to x-axis and y-axis - Add a
geom_point()geometry with analpha = 0.5 - Apply
theme_tq()and add a title “UMAP Projection”
We can now see that we have some clusters. However, we still need to combine the K-Means clusters and the UMAP 2D representation.
